[HUDI-5533] Support spark columns comments #8683

parisni · 2023-05-10T22:42:21Z

Change Logs

fixes #7531 ie: show comments within spark schemas

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

parisni · 2023-05-14T13:01:55Z

Does Option.ofNullable work correctly here?

Scala Option does not have ofNullable (java Optinonal do have). BTW Option(value) is equivalent as what you suggest, and I have corrected

danny0405 · 2023-05-15T02:41:07Z

...datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-      case INT => avroSchema.getLogicalType match {
-        case _: Date => SchemaType(DateType, nullable = false)
-        case _ => SchemaType(IntegerType, nullable = false)
+    (avroSchema.getType, Option(avroSchema.getDoc)) match {


The conversion tool is copied from Spark: https://github.com/apache/spark/blob/dd4db21cb69a9a9c3715360673a76e6f150303d4/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L58, just noticed that Spark also does not support keeping comments from Avro fields while doing the converison.

Likely spark also have this limitation when retrieving schema from avro. But spark don't usually infer spark schema from avro. Hudi does, and that's the reason of the patch.

Can you write a test case for it, especially for creating table.

@danny0405 added a test

Kind of feel there is no need to change each match for every data types, can we write another method similiar with toSqlTypeHelper which invokes toSqlTypeHelper firstly then fix the comment separately.

Kind of feel there is no need to change each match for every data types

Any columns including nested columns also may have comments, so I don't see why we should'nt look after all avro content for doc.

which invokes toSqlTypeHelper firstly then fix the comment separately.

This would lead to walk thought the avro schema two times, and also lead to complex merge of results. Am i missing something ?

parisni · 2023-06-12T17:54:08Z

while spark.read.format("hudi").load("path") and spark.table("database.table") returns the comments within the schema, spark.sql("desc database.schema") won't. So I have to investigate and add a fix.

…umns # Conflicts: # hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala

parisni · 2023-06-29T23:23:18Z

I have investigated a bit, and here my current understanding:

Reading hudi table w/ spark has two path:

if spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog (which is what hudi recommend in the documentation), then hudi will rely on the HiveSessionCatalog to get the schema. Then if it's a hive metastore implementation, spark will try to get the schema as case sensitive and thus not get it from the hive schema (which is case insensitive), and fall back fetching the table properties spark.sql.sources.schema instead. If it's a glue metastore likely the same happens. BTW, our hive_sync service currently don't propagate the comments in the spark.sql.sources.schema and that's why in this case spark.sql("desc table") or spark.table("table").schema won't return the comments. This behavior can currently be avoided by setting hoodie.datasource.hive_sync.sync_as_datasource=false, which forces spark to grab the information from hive (by letting the spark properties empty in the hms), but in a case insensitive way. I'm not sure what are the consequences of relying on hive only.
if spark.sql.catalog.spark_catalog is not set or if reading hudi table by path spark.read.format("hudi").load("path"), then spark uses the path updated in this PR, by mean get the schema information from the hudi avro file. Except when using spark.sql("desc table") b/c spark fallbacks to hiveSessionCatalog in this case.

So right now, using this PR and setting bothhoodie.datasource.hive_sync.sync_comment=true and hoodie.datasource.hive_sync.sync_as_datasource=false, one will get the comments in any case (by identifier or by path). However not setting the spark datasource informations within the HMS might have some bad effects (if not, why making so much efforts to maintains two schemas within the hms?).

To fix this we could:

make hive_sync populate the comments in the properties
make HoodieCatalog not use anymore the HiveSessionCatalog to get the schema, but use the hudi avro in place and skip the HMS for this.

I would go for 1 b/c it keeps the current logic intact, and also cover the case spark.sql("desc tablename") when spark.sql.catalog.spark_catalog is not set.

Thought @danny0405 @bhasudha @yihua ?

…umns

parisni · 2023-07-10T14:19:57Z

@danny0405 The PR is ready.

The way to activate comment for all engines is as below:

hoodie.datasource.hive_sync.sync_comment=true
hoodie.datasource.hive_sync.sync_as_datasource=false

To recap:

hoodie.datasource.hive_sync.sync_comment: ads the comments in the metastore (hive, glue ...)
the current PR fixes missing comment when reading the hudi table from path (not metastore) with spark
hoodie.datasource.hive_sync.sync_as_datasource: allow spark to get comments from the metastore

BTW, I will provide some details about comments in the doc IYW

The reason spark datasource should be disabled is our current table property builder rely on parquet types which does not know about comments. It would be painful to modify and spark table properties are useless and leads to errors.

parisni · 2023-07-23T12:33:24Z

@danny0405 just learned here why the spark data source stuff shall be kept within hms

With the above, all of Spark SQL queries will use Hudi DataSource and hence end up using the custom FileIndex for doing the listing.

Then there is a need to update the datasource code to be aware of comments within hive sync.

@prashantwason can you confirm this part is still applicable and as a result hoodie.datasource.hive_sync.sync_as_datasource is highly encouraged ?

…umns

parisni · 2023-07-29T21:25:57Z

@danny0405 implemented hive sync datasource, so to clarify about what brings this PR :

spark read hudi from path (recursive support for comments)
spark read hudi from a hive metastore with hoodie.datasource.hive_sync.sync_as_datasource=true (default) : support for first level of comments

#4960 provided:

spark read hudi from hive metastore with hoodie.datasource.hive_sync.sync_as_datasource=false : support for first level of comments
query engine (athena, trino..) access to comments from hive metastore

#8740 provided:

spark read hudi from glue metastore with hoodie.datasource.hive_sync.sync_as_datasource=false : support for first level of comments
query engine (athena, trino..) access to comments from glue metastore

Not covered yet:

flink support for comments
deltastreamer (not sure)

hudi-bot · 2023-07-30T00:30:29Z

CI report:

8d6893f UNKNOWN
de250fb UNKNOWN
b92e5eb Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-08-08T09:15:29Z

Thanks @parisni , would find some time for the review ~

danny0405 · 2023-08-08T09:45:35Z

hudi-sync/hudi-adb-sync/src/main/java/org/apache/hudi/sync/adb/AdbSyncTool.java

@@ -212,8 +213,9 @@ private void syncSchema(String tableName, boolean tableExists, boolean useRealTi
    Map<String, String> tableProperties = ConfigUtils.toMap(config.getString(ADB_SYNC_TABLE_PROPERTIES));
    Map<String, String> serdeProperties = ConfigUtils.toMap(config.getString(ADB_SYNC_SERDE_PROPERTIES));
    if (config.getBoolean(ADB_SYNC_SYNC_AS_SPARK_DATA_SOURCE_TABLE)) {
+      List<FieldSchema> fromStorage = syncClient.getStorageFieldSchemas();
      Map<String, String> sparkTableProperties = SparkDataSourceTableUtils.getSparkTableProperties(config.getSplitStrings(META_SYNC_PARTITION_FIELDS),


fromStorage -> fieldSchema

danny0405 · 2023-08-08T09:46:14Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/JDBCExecutor.java

@@ -65,7 +65,7 @@ public void runSQL(String s) {
    try {
      stmt = connection.createStatement();
      LOG.info("Executing SQL " + s);
-      stmt.execute(s);
+      stmt.execute(escapeAntiSlash(s));


Is the escape related with this change ?

yes, both JDBC and HMS generate sql and antislash should be doubled escaped otherwise it is lost.

Antislash is used to escape double quotes in the comments DDL.

danny0405 · 2023-08-08T09:48:04Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/QueryBasedDDLExecutor.java

+
+  /**
+   * SQL statement should be be escaped in order to consider anti-slash
+   *


Every sentence should end up with .. Every new paragraph should start with <p>, remove the params comments if there are no comments at all.

danny0405 · 2023-08-08T09:49:13Z

...udi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java

+  private static String getComment(String name, List<FieldSchema> fromStorage) {
+    return fromStorage.stream()
+        .filter(f -> name.equals(f.getName()))
+        .filter(f -> f.getComment().isPresent())


So we only match from the first level for the fields by name.

Yes that's the current limitation of comments support. It's a flatten list of fields.

danny0405 · 2023-08-08T09:49:27Z

...udi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java

+
+  private static String escapeQuote(String s) {
+    return s.replaceAll("\"", Matcher.quoteReplacement("\\\""));
+  }


Not sure why we need a escape.

see UT. I added a double quote in the column comment, and for some hive sync ddl exectutors, this is needed

danny0405 · 2023-08-08T09:49:56Z

...udi-sync-common/src/main/java/org/apache/hudi/sync/common/util/Parquet2SparkSchemaUtils.java

@@ -133,7 +154,7 @@ private static String convertPrimitiveType(PrimitiveType field) {

  private static String convertGroupField(GroupType field) {
    if (field.getOriginalType() == null) {
-      return convertToSparkSchemaJson(field);
+      return convertToSparkSchemaJson(field, Arrays.asList());
    }


Does this mean composition data type is not supported?

yes, currently in hive sync we don't have the composition data type List<FieldSchema> is flat

danny0405 · 2023-08-15T04:50:12Z

...datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/TableOptionProperties.java

-        messageType);
+        messageType,
+        // flink does not support comment yet
+        Arrays.asList());


Collections.emptyList() ?

[HUDI-5533] Support spark columns comments

7bdb949

danny0405 reviewed May 13, 2023

View reviewed changes

...datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala Outdated Show resolved Hide resolved

Improve option syntax

8d6893f

danny0405 reviewed May 15, 2023

View reviewed changes

danny0405 self-assigned this May 15, 2023

danny0405 added schema-and-data-types priority:minor everything else; usability gaps; questions; feature reqs labels May 15, 2023

Add test for mor and cow tables comment

de250fb

parisni requested a review from danny0405 June 12, 2023 17:54

Merge remote-tracking branch 'hudi/master' into fix/support-spark-col…

f41404a

…umns # Conflicts: # hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala

Merge remote-tracking branch 'hudi/master' into fix/support-spark-col…

e216e51

…umns

parisni closed this Jul 10, 2023

parisni reopened this Jul 10, 2023

Add docs for as_datasource

a8b972a

parisni added 4 commits July 29, 2023 23:02

Impl hive sync datasource comments and test

f320fd8

Merge remote-tracking branch 'hudi/master' into fix/support-spark-col…

13993d6

…umns

Rename variables

b00bc0b

Fix style

89409c2

parisni mentioned this pull request Jul 29, 2023

[SUPPORT]hoodie.datasource.hive_sync.sync_comment attribute not work #9124

Closed

parisni added 2 commits July 29, 2023 23:32

Fix ADB

20b2791

Fix flink spark datasource

b92e5eb

danny0405 reviewed Aug 8, 2023

View reviewed changes

parisni requested a review from danny0405 August 14, 2023 19:37

danny0405 reviewed Aug 15, 2023

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5533] Support spark columns comments #8683

[HUDI-5533] Support spark columns comments #8683

parisni commented May 10, 2023

parisni commented May 14, 2023

danny0405 May 15, 2023

parisni May 15, 2023

danny0405 May 15, 2023

parisni May 15, 2023

parisni May 19, 2023

danny0405 May 20, 2023

parisni May 22, 2023

parisni commented Jun 12, 2023

parisni commented Jun 29, 2023

parisni commented Jul 10, 2023 •

edited

Loading

parisni commented Jul 23, 2023

parisni commented Jul 29, 2023 •

edited

Loading

hudi-bot commented Jul 30, 2023

danny0405 commented Aug 8, 2023

danny0405 Aug 8, 2023

danny0405 Aug 8, 2023

parisni Aug 14, 2023

danny0405 Aug 8, 2023

danny0405 Aug 8, 2023

parisni Aug 14, 2023

danny0405 Aug 8, 2023

parisni Aug 14, 2023

danny0405 Aug 8, 2023

parisni Aug 14, 2023

danny0405 Aug 15, 2023

[HUDI-5533] Support spark columns comments #8683

Are you sure you want to change the base?

[HUDI-5533] Support spark columns comments #8683

Conversation

parisni commented May 10, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

parisni commented May 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parisni commented Jun 12, 2023

parisni commented Jun 29, 2023

parisni commented Jul 10, 2023 • edited Loading

parisni commented Jul 23, 2023

parisni commented Jul 29, 2023 • edited Loading

hudi-bot commented Jul 30, 2023

CI report:

danny0405 commented Aug 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parisni commented Jul 10, 2023 •

edited

Loading

parisni commented Jul 29, 2023 •

edited

Loading